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Abstract 

Pairwise learning usually refers to a learning task which involves a loss 
function depending on pairs of examples, among which most notable ones 
include ranking, metric learning and AUC maximization. In this paper, 
we study an online algorithm for pairwise learning with a least-square loss 
function in an unconstrained setting of a reproducing kernel Hilbert space 
(RKHS), which we refer to as the Online Pairwise lEaRning Algorithm 
(OPERA). In contrast to existing works |18l [36] which require that the it¬ 
erates are restricted to a bounded domain or the loss function is strongly- 
convex, OPERA is associated with a non-strongly convex objective function 
and learns the target function in an unconstrained RKHS. Specifically, we 
establish a general theorem which guarantees the almost surely convergence 
for the last iterate of OPERA without any assumptions on the underlying 
distribution. Explicit convergence rates are derived under the condition of 
polynomially decaying step sizes. We also establish an interesting property 
for a family of widely-used kernels in the setting of pairwise learning and il¬ 
lustrate the above convergence results using such kernels. Our methodology 
mainly depends on the characterization of RKHSs using its associated inte¬ 
gral operators and probability inequalities for random variables with values 
in a Hilbert space. 


1 Introduction 

For any T E N, the input space X is a compact domain of and the output space 
}iCR. In the standard problems of regression and classification |HE2] , one con¬ 
siders learning from a set of examples z = {zi = (x;, r/j) E X x y : i — 1, 2,..., T} 
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drawn independently and identically (i.i.d) from an unknown distribution p on 
Z = X x y. Associated with a specific learning problem, typically a univariate 
loss function £(h, x, y ) is used to measure the quality of a hypothesis function 

h:X->y. 

This paper is motivated by the growing interest in an important family of 
learning problems which, for simplicity, we refer to as pairwise learning problems. 
In contrast to classical regression and classification, such learning problems involve 
pairwise loss functions, i.e. the loss function depends on a pair of examples which 
can be expressed by £(f, (x,y), (x',y')) for a hypothesis function / : X x X —> M. 
Many machine learning tasks can be formulated as pairwise learning problems. 
Such tasks include ranking [U EH EGL 171 [23], similarity and metric learning J5[ [8[ 
HU [351 [39], AUC maximization [02], and gradient learning HH [22]. For instance, 
the task of ranking is to learn a ranking function capable of predicting an ordering 
of objects according to some attached relevance information. It generally involves 
the use of a misranking loss £(/, ( x,y ), (x',y')) = \( y -y')f{ x ,x')<o} or its surrogate 
loss £(/, (x, y), ( x',y ')) = (1 — (y — y')f(x,x')) 2 , where !(•) is the indicator function. 
The goal of ranking is to find a ranking rule / in a hypothesis space H from the 
available data that minimizes the expected misranking risk 

K(f)=[[ £(f,(x,y),(x',y'))dp(x,y)dp(x',y'). (1.1) 

In this paper, we assume that the hypothesis function / : X x X —y M for pairwise 
learning belongs to a reproducing kernel Hilbert space (RKHS) defined on the prod¬ 
uct space X 2 = X x X. Specifically, let K : X 2 x X 2 —y R be a Mercer kernel , i.e. a 
continuous, symmetric and positive semi-definite kernel, see e.g. p3][32]. Accord¬ 
ing to [2], the RKHS Hk associated with kernel K is defined to be the completion of 
the linear span of the set of functions {K^ xx ^{-) := K((x,x'), (■, •)) : (x,x') € X 2 } 
with an inner product satisfying the reproducing property, i.e., for any x'xEX 
and / e Hk, (K( x , x '),f) K = f(x,x'). 

Recently, a large amount of work focuses on pairwise learning algorithms in the 
batch setting in the sense that the algorithm uses the training data z at once. A 
general regularization scheme in a RKHS Hk for pairwise learning can be formu¬ 
lated as 

/z,a = ar g/g^,{ t(T- 1) (zj’Vj)) + ( L2 ) 

where A > 0 is a regularization parameter. The above general formulation was 
studied for ranking EUE5] and metric learning 0E] under choices of different 
pairwise kernels (see further discussions in Subsection 12.11) . Their generalization 
analysis was established using the concept of algorithmic stability [I], robustness 
0 or U-statistics and U-process (U 03] [25J. However, there is relatively little 
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work related to online algorithms for pairwise learning, despite of its potential 
capability of dealing with large datasets. Until most recently, [36] established the 
first generalization analysis of online learning methods for pairwise learning in the 
linear case. In particular, they showed online to batch conversion bounds hold true 
which are similar to those in the univariate loss function case [9j. 

In this paper, we study an Online Pairwise lEaRning Algorithms (OPERA) 
with a least-square loss function in a reproducing kernel Hilbert space (RKHS). 
In particular, a general convergence theorem is established which guarantees the 
almost surely convergence of the last iterate of OPERA. Explicit convergence rates 
are derived under the condition of polynomially decaying step sizes. In contrast to 
existing works [18] SB] which require that the iterates are restricted to a boimded 
domain or the loss function is strongly-convex, OPERA is associated with a non- 
strongly convex objective function and learns the target function in an uncon¬ 
strained RKHS (see more discussions in Section [3]). Our novel methodology mainly 
depends on the characterization of RKHSs using the associated integral operators 
and probability inequalities for random variables with values in the Hilbert space 
of Hilbert-Schmidt operators. 

The paper is organized as follows. Section [2] introduces OPERA and presents 
main results together with particular examples of specific pairwise kernels. Section 
[3] discusses the related work. Section [I] presents novel error decomposition for 
analyzing OPERA and establishes the associated technical estimates. The main 
results are proved in Section [5] The paper concludes in Section [6] The proofs for 
technical lemmas are postponed to the Appendix. 


2 Main Results 

In this section, we introduce an online pairwise learning algorithm associated with 
the least-square loss £(/, (x , y ), (xy ')) = ( f(x , x') — y + y') 2 in a reproducing kernel 
Hilbert space T~Lk, and state our main results. In particular, denote the true risk, 
for any function / : X x X —> M, by 

£(/)= If -y+ y') 2 dp(x,y)dp(x',y'). 

J JZxZ 

Denote by f p the minimizer of the functional £(•) among all measurable functions. 
It is easy to see that f p can be represented by the difference of two standard 
regression functions, i.e. 

f P (x,x')= [ ydp(y\x)~ [ ydp(y\x') = f p (x) - f p (x'). (2.1) 

Jx Jx 

For this reason, throughout this paper we refer to f p as the pairwise regression 
function. Denote by L 2 p (X 2 ) the space of square integrable functions on the domain 
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X x X, i.e. 

L l(X 2 ) = If ■ X x X -> M : ll/llp = ( If x I f(x,x')\ 2 dpx(x)dp x (x')^ < oo j , 

where px is the marginal distribution of p over X. Similar to the standard least- 
square regression problem (see e.g. [12]), the following property holds true 

£(f)-£{fp) = \\f-f P \\ 2 P - 

In this paper, we study the following online pairwise learning algorithm which aims 
to learn the pairwise regression function f p from data. 

Definition 1. Given the i.i.d. generated training data z = {zt = (xi,yi) : i = 
1,2,... ,T}, the Online Pairwise lEaRning Algorithm (OPERA) is given by f\ = 
f 2 — 0 and, for 2 < t < T, 

t- i 

ft+ i = ft - Xj) —y t + yj)K {xtiXj) , (2.2) 

3 = 1 

where {y* > 0 : t E N} is usually referred to as the sequence of step sizes. 


OPERA is similar to the online projected gradient descent algorithm in | T8lf36i] . 
i.e., /o = 0 and p = and, for 1 < t < T, 

t- i 

ft = Proj Bfl [ft- 1 ~ J2(f t (x t , Xj) -y t + yj)K {xuXj) )], (2.3) 

3 =1 

where Proj Bfi ,(-) denotes the projection to a prescribed ball Br = {||/||x < R-.fe 
Hk} with radius R. In contrast, OPERA does not have this additional projection 
step and is implemented in the unconstrained setting. 


The sequence {f t : t = 1,2,... ,T + 1} is usually referred to as the learning 
sequence generated by OPERA. We call the above algorithm OPERA an online 
learning algorithm in the sense that it only needs a sequential access to the training 
data. Specifically, let z t = {zi,Z 2 ,... ,z t j and at each time step t + 1, OPERA 
presumes a hypothesis f t G PLk upon which a new data z t is revealed. The quality 
of the pairwise function f t is estimated on the local empirical error: 


£*(/*) 


1 

2(t - 1) 


t -i 

^(MxtiXj) 

3 = 1 


Vt + Vj ) 2 - 


(2.4) 


The next iterate f t +\ given by equation (12. 2 p is exactly obtained by performing a 
gradient descent step from the current iterate ft based on the gradient of the local 
empirical error, which is given by 

1 t_1 

(f)\f=f t = - — — ^ ] (ft(x t ,Xj) — yt + yj)K{x uXj ). 


4 






Here, V£ 4 (-) denotes the functional gradient of the functional £ t in the RKHS TLk- 
Now denote k := sup a/ K((x, x'), (x, x')), and throughout the paper we as- 

x,x£X 

sume that \y\ < M almost surely for some M > 0. In addition, we introduce the 
notion of /C-functional [6] in approximation theory as 

£(sjp) ■= inf {\\f-f P \\ P + s\\f\\ K }, s> 0. (2.5) 

We can establish the following general theorem about the convergence of the last 
iterate fc +1 generated by OPERA. 

Theorem 1. Let 7 t = for any i 6 N with some 9 £ (|, 1) and p > A 2 , and 
{ft : t — 1,..., T + 1} be given by OPERA \2.2 I) . For any 0 < 6 < 1, we have with 
probability 1 — 5 

II/t+i - Jx < K( Am + k)T-^, 7„) + C»,„ T- "“'("-W) log Tlog(ST/S), 

( 2 . 6 ) 

where Cq >k depends on k, 9 but independent ofT (see its explicit form in the proof). 


Recall the well-known result (e.g. Pi SO]) that 

lim £(5, f p ) = inf ||/ - f p \\ p . 

S—^U-h J^Hk 

Then, assuming 9 £ (1/2,1) and letting T —> oo in inequality (12.6(1 , we can prove 
the following corollary. 

Corollary 1. If = j-t~ d for any t 6 N with 9 £ (|, 1) and p > k 2 , and 
{ft : t — 1,..., T + 1} be given by OPERA \2.2\) . Then, ||/t+i — f P \\ P converges to 
inf /eWK ||/ - f p ||p almost surely. 


Let us discuss the implication of the above corollary. Recall that a kernel is uni¬ 
versal if its associates RKHS is dense in the space of continuous functions on X x X 
under the uniform norm. Typical examples of universal kernels (201 [32] include the 


Gaussian kernel /^((x/x 2 ), (x/x 2 )) = exp(— 
nel A"((x x , x 2 ), (x 1 , x 2 )) = exp(— ^ 


(x 1 ,x 2 ) — (ar 


and the Laplace ker- 


1). In this case, inf f&lK \\f-f P \\ P = 0, 
which equivalently implies that, as T -4- oo, ||/t+i — f P \\ P —> 0 almost surely. 


We can derive explicit error rates under some regularity assumptions on the 
pairwise regression function. The regularity of f p can be typically measured by the 
integral operator L k : L 2 (T 2 ) -> L 2 (T 2 ) dehned by 


Lxf 



f(x,x')K {XjXl) dp x (x)dp x (x'). 


Since K is a Mercer kernel, Lx is compact and positive. Therefore, the fractional 
power operator L^ K is well-dehned for any /3 > 0. In particular, we known from 

[121 HI] that Lj/ 2 (L 2 (T 2 )) = Hk- 
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Theorem 2. Let {f t : t = + 1} be given by OPERA 112.2 )) . Suppose 

f p G L^(L 2 ) with some (5 > 0 and choose 7 * = if™i 2 ?+ 2 ’ 5 j with some p > k 2 . 
Then, for any 0 < 5 < 1 we have, with probability 1 — <5, that 

II/t +1 - 7jp < logTlog( 8 T/h), (2.7) 

where C^ K depends on f3,n and p but independent of T (see the explicit form in 
the proof). 

The algorithm OPERA depends on selecting an appropriate pairwise kernel for 
a given learning task. In the next subsection, we consider a specific class of pairwise 
kernels and their associated RKHSs which are induced by a kernel G : X x X —)■ R. 

2.1 Examples with specific pairwise kernels 

Observe that the pairwise regression function f p (x,x') = f p (x) — f p (x'), and hence 
a natural motivation is to use a pairwise function f(x, x r ) = g(x) — g(x') to ap¬ 
proximate the desired function f p , where g G Pic with G : X x X —» 1 being a 
kernel. 

Indeed, we can introduce a specific pairwise kernel K such that any function 
/ G TLk can be represented by as f(x,x') = g(x) — g(x') with g G PLg■ Specifically, 
given the univariate kernel G, let the pairwise function K : X 2 x X 2 —>■ R defined, 
for any x l , x 2 , x 1 , x 2 G X, by 

K((x 1 , x 2 ), (x 1 , x 2 )) = G(x 2 , x 1 ) + G(x 2 , x 2 ) — G(x x , x 2 ) — G(x 2 , x 1 ) , , 

= (G x 1 - G X 2,G X 1 - G X 2) G . 

It can be easily verified that K defined by (12. 8 p is positive semi-definite on X 2 x X 2 , 
and thus K is a (pairwise) Mercer kernel on X x X if G is a Mercer kernel on X. The 
following proposition characterizes the relationship between ELk and the original 

RKHS n G . 

Proposition 1. Let G : X x X —>■ R be a Mercer kernel and its associated pairwise 
kernel be induced by \2.8\) . Then, the following statements hold true: 

(a) Assume the constant function lx G TL G and let Z G = span{lx G TL G } con¬ 
taining all constant functions and Zq = {g G TL G : (g, 1 x)g — 0} be the subspace 
orthogonal to Z G . Then, the mapping Ty : Zq —» PL K defined by ^(g)^ 1 , x 2 ) = 
^(x 1 ) — g(x 2 ) is a bijection with property ||Q : (g)||x = ||(?||g- 

(b) If the constant function lx ^ R G , then the mapping Ss : TL G —> TLk defined by 
2s(g)(x 1 ,x 2 ) = g(x 1 ) — g(x 2 ) is a bijection with property ||Q : (g)||i<: = ||g||G- 
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Part (b) used the assumption lx f TLg■ Various kernels induce RKHSs satisfy¬ 
ing this assumption. For instance, the homogeneous linear kernel G(x,x') = x T x l 
and the Gaussian kernel G(x,x') = exp(— 11 ) [33] are such kernels. However, 
in general the assumption in part (b) is not true, and thus only part (a) holds true. 

From the above proposition, we can rewrite OPERA (12. 2 p as g\ = g 2 = 0 and, 
for 2 < t < T, 


9t +1 = 9t~ It 


1 


o =i 


(9t{x t ) - 9t{xj ) -y t + Vj){G 


Xt 


G , 


(2.9) 


The learning sequence {f t : t = 1,2 ,,T + 1} of OPERA can be recovered by 

ft(x 1 ,x 2 ) = ^(^)(x 1 ,a: 2 ) := gfx 1 ) - g t {x 2 ), \/x l ,x 2 e A. ( 2 . 10 ) 


Denote 

L%X) = {/:*-> R : ll/ll, = (j \f(x)\ 2 dp x (x)y /2 < cx>}, 

and, by applying Proposition Q] we can see that the K-functional K defined by 
( 12 .5p is reduced to 


JCa(sJ p ) 


inf gex^{ll S (^) _ fp\\p + s II^IIg}, if Iat e U G 

infseWcdl^^) - fp\\ P + sllfi'llc}, otherwise. 


Equipped with the above notations, we can obtain the following theorem. 

Theorem 3. Let 7 t = fr for any t e N with 6 £ (|, 1) and {g t : t — 1, 2,..., T+l} 
be given by algorithm 12.LA) . Then, the following statements hold true. 


(a) Let the K-functional associated with TLg be defined by 12.11\) . Then, for any 
1 < S < 1 we have, with probability 1 — 5, that 

||0(<7r+i) - J P || p < K g {V6 k(1 + k)T~^-J p ) + C d)K T~ m In T. (2.12) 


(b) Suppose lx f TLg an d f p e Lq(L 2 AX)) with some 0 < /3 < 1/2 and choose 
7 1 = -fit 2 p+ 2 . Then, for any 0 < 0 < 1 we have, with probability 1 — 5, that 

||3(0t+i) - fp\\p] < C 0 , k T-^T 2 In T. (2.13) 


The above theorem implies the following result. Suppose that the original 
univariate kernel G is a Gaussian kernel in (12 . 81) . choosing with 7 t = *— with 
6 e (1/2,1) in (12.9p . by a similar argument to the proof for Corollary [1] we can 
have IIO/gT+i) — f p \\p —> 0 almost surely as T —> 00 . It remains a question to us 
whether the assumption lx f TLg i n part (b) of the above theorem can be removed 
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3 Related work and Discussions 


In this section, we discuss the related work on pairwise learning in the batch setting 
and stochastic online learning algorithms in the univariate case. 

Firstly, we briefly review existing work on pairwise learning, among which most 
of them addressed the batch setting. In [25], the generalization analysis for the gen¬ 
eral formulation (II.2p was conducted using empirical process and U-statistics (see 
discussions in Example 3 there). Specifically, the author proved nice generaliza¬ 
tion bounds for the excess risk of such estimators with rates faster than 0(1/\/T), 
where T is the sample number. In Section 5.2 of [T], the following regularization 
formulation was studied for ranking: 

f 2 T A 1 

min | T ( T -i) S ~ Vj) + if IMIgJ> (3- 1 ) 

i<j 

where Kg denotes the RKHS on X with inner product || • ||g and 0 is a ranking 
loss function (see Definition 1 there). This formulation can be regarded as a special 
formulation of the general framework (11.21) since, by Proposition [I] one can choose 
A"((x 1 ,x 2 ), (x 1 ,^ 2 )) = (G x i — G x i,G x i — G0 2 )g, and then, for any / e Kk, there 
exists a g G Kg such that f(xi,Xj ) = g(xi ) — g(xfi with property ||/|| K=\\g\\ G . in 
contrast to the batch setting, there is relatively little work on online algorithms for 
pairwise learning. Most recently, in PO] and [TS] online to batch conversion bounds 
were nicely established for pairwise learning, which shares the same spirit ofJH] in 
the univariate case. Specifically, Kar et al. [18] proved the following result. □ 

Theorem A ^ Let /i, / 2 ,..., fr-i be an ensemble of hypotheses from the space 
K generated by an online learning algorithm with a B-bounded loss function £ : 
K x Z x Z —> [0,5] that guarantees a regret bound o/Jhr., he. 

T i 1-1 T t- 1 

^-—^2£(ft-l, z t, z r)< mf^2-—J2 £ (f^ Z t,Z T ) + ^T- (3.2) 

1=2 1 1 r=l i H 1=2 1 1 t—1 


Then, for any 0 < 6 < 1, we have with probability 1 — 5, 


^ t a T s ? 

— r E ^ + —r E ° «) + vTr + 6B ’ 


T-1 


1=2 


1=2 


lQ gf 

T-1’ 


where, for any f G K, £t(f) = JJ Z z £(f, z, z')dp(z)dp(z'), and the Rademacher 

averages 7 Zt(£ ° K) is defined as 1Z t -i(£ o K) — E[sup fteW ^ X^r=i £ r£(h, z, z T )\ 
with the expectation being over e T , z, and z T . 

lr The authors mainly focused on the linear case. However, the results there can be easily 
extended to the kernelized case. 













For a fair comparison with our results, let the loss function £(f, (x, y ), (x', y')) = 
(f(x,x') — y + y') 2 and the hypothesis space R be a bounded ball in an RKHS 'Hk, 
i.e.H=B R := {fen K :\\f\\K < R} with some R > 0. In this case, the constant 
B in Theorem A is given by B = (2 M + kR) 2 , and £(■, z, z') is Lipschitz continuous 
with constant L = 2 M + k,R. By standard techniques to estimate the Rademacher 
averages, we can have o R) < with R sufficiently large. Then, using 

an argument similar to the Section 5.3 of [351 we know that the online projected 
gradient descent algorithm (12.3p enjoys the regret bound < (2 M + kR)R\/T. 
Putting this regret bound with the above estimation for the Rademacher averages 
together, from Theorem A we get, with probability 1 — 5, that 

LrEL£(/<) - „“/(/) < EL jt + % + «V log P( T - 1 0 

Let f T = Y^t =2 ft an d then we have S{f T ) < Consequently, 

S(f T ) — inf < o(r 2 \J —This estimation combined with the fact, for 
any /, that £(f) - £{J p ) = \\f - f p \\ 2 implies that 

S / - *16 + o(«Viog(l)/r). 

The first term on the righthand side of the above inequality is known as ap¬ 
proximation error. Suppose the pairwise regression function f p £ l k( l I ) with 
some 0 < /3 < 1/2. Then, we know from [lij [29j that inf||/|| K <R ||/ — f p \\ 2 < 

R 1 ^ \\Ljf f p \\p ~ 2fi , which implies that ||/ T - < O^R“^||L^ /3 / p || p 1 “ 2 ' 3 + 

R 2 y^log^j)/T^. Choosing i? = T — implies, with probability 1 — 5, that 

H7t - m < o(r^(logy/f/5+\\Ljff p \\J^)y (3.3) 

From Theorem [21 for 0 < (3 < 1/2 the last iterate of OPERA has the convergence 
rate: 

II h - h II? < ° (r - ™(log T log(8T/,5)) 2 ). (3.4) 

Comparing the rates in (j3.3j) and (13.41) , we can see that our rate (13.4|) for the last 
iterate of OPERA is suboptimal to that of the average of iterates generated by 
algorithm (12.31) . However, the online projected gradient descent algorithm (I2.3[) re¬ 
quires that all iterates are restricted to a prescribed ball with radius R , which leads 
to a challenging question on how to tune R appropriately according to the real- 
data at hand. In addition, the analysis techniques [T81 [36] critically depend on the 
bounded-domain assumption and do not directly apply to the unconstrained setting 
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here. OPERA is performed in the uncontrained setting and hence is parameter-free 
expect the choice of step sizes. Indeed, theorems in Section El show that choosing 
7 1 = 0{t~ e ) with 1/2 < 9 < 1 always guarantees that the last iterate of OPERA 
converges almost surely without additional assumptions on the underlying distri¬ 
bution p. 

Secondly, we discuss the related work on (stochastic) online learning algorithms 
in the univariate case. There is a large amount work on (stochastic) online learning 
algorithms in the univariate case (SEE Ell ESI EQl SI] or under a more general name 
called stochastic approximation (3] ESI [ 2 j[. The main idea is to use a randomized 
gradient to replace the gradient of the empirical loss, where the original idea dates 
back to the work [26] in the 1950s. Most of approaches in stochastic approximation 
assume the hypothesis space is of finite dimensional and the gradient is bounded. In 
fact, when the hypothesis space is of finite dimensional, a simple averaging scheme 
for stochastic gradient descent [3J can achieve the optimal rate O(^) under the 
assumption that the covariance operator f x xx T dp x (x) is invertible. Stochastic 
online learning with a least square loss in an infinite-dimensional RKHS has been 
pioneered by [28] and the results were established for general loss functions by [47], 
in which the objective functions are all strongly convex. 

OPERA (12.21) shares a similar idea with the above algorithms in the univariate 
case in the sense that, at each iteration, it uses a computationally-cheap gradient 
estimator to replace the true gradient. However, the objective function of OPERA 
is not strongly convex and the hypothesis space %k is not bounded. In particular, 
OPERA is more close to the online algorithm in [0U], where the authors studied 
the following stochastic gradient descent in a RKHS 7-Lg'- 

f g i = 0 and , Vt G 1, 2,..., T 

\ 9t +i = 9t~ 7 t(9t{x t ) ~ Vt)G X f 

The analysis in [40] heavily depends on the fact that the randomized gradient 
(gt(x t ) — yt)G Xt is, conditionally on {zi, Z2, ..., z t -i} , an unbiased estimator of 
the true gradient ff x (g t (x) — y)G x dp(x,y). However, the randomized gradient 

Y! 3 r J\ x j) — yt + yj)K{xt, Xj ) i 11 OPERA (j2.2|) is not an unbiased estimator 
of the true gradient JJ XxX ft(x, x') — y + y')K^ x ^dp{x, y)dp(x', y'), even condi¬ 
tionally on {zi, Z2, ■ ■ ■, Zt- 1 }. This introduces the main difficulty in analyzing its 
convergence. Our new methodology relies on the novel error decomposition pre¬ 
sented in the next section. This enable us to overcome this analysis difficulty by 
further employing the characterization of RKHSs using the associated integral op¬ 
erators and probability inequalities for random variables with values in the Hilbert 
space of Hilbert-Schmidt operators. 
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4 Error Decomposition and Technical Estimates 


This section mainly presents an error decomposition for OPERA which is critical 
to prove the main results in Section [2] 

To this end, we introduce some necessary notations. For any 1 < j < t, denote 
the linear operator 


L(xt,Xj) (') K(xt,Xj))KK(x t , X j) • 'M-k ^ LLk 

by L {xuXj) (g) = g{x u x j )K [xttX . ) for any g G U K , and let L t = A. £(*«,*,)• In 
addition, define 


t- i 


S(z t ,zj) (Ut yj)K{x t ,xj)i and St _ ^ ^ ' ^(. zt ,zj) 


3 =1 


We also define an auxiliary operator L t = L t dp(z t ), i.e., for any / G TLk 


' x 


t- i 


= f( X ’ X t) K (x,x e )dpx(x). 

1 t=l J x 


Similarly, define 

1 

t - 1 

In addition, let 

4* = (L t - L K ) f t - {S t - L K f p ), & = (Lt - L t )ft ~ (St - St). 
With these notations, for any t > 2 we can rewrite equality (12. 2 p as 

ft +1 = ft- 7 t(L t (ft) ~ S t ) = (I - 7 tL K )ft ~ lt(U - L K )(ft ) + ltS t , 


S t = [ Stdp(zt) 
Jx 


t -i „ 

£ / (f P ( x )-yt) K (x,x e )dpx(x). 
i=i Jx 


and 

ft+i -f P = (I ~ ltL K )(ft - Ip) - lt(L t - L K )f t + 7 t ( 5 t - L*/ p ) f4 ^ 

= (I- lt L K )(ft-fp)-ltA t - 1 tBK 1 J 

For any t, j G N denote c Oj(L k ) = Y(' (=3 (I — ^iL k ) for any j < t and we use the 
conventional notation, for any t G N, o;* +1 (Lfc) = / and Y^e=t+i 7^ = 0- 

Consequently, from the above equality we can derive, for any t > 2, that 

t t 

ft+i - /p = -y(iA-)(/p) - E 7py + i(iA)v - E 7i7 + i(iA)B j ( 4 . 2 ) 

J '=2 3=2 
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The above error decomposition is similar to the well-known ones in learning the¬ 
ory in order to perform the error analysis for learning algorithms with univari¬ 
ate loss functions, see e.g. da m E3 eh]. The term ujl(L K )(f p ) is determin¬ 
istic which is usually referred to as approximation error and the other term, i.e. 
J+ =2 {L k )A ] + )>+ =2 Tj^j+i {Lk)B j , depends on the random samples which 

is often called the sample error. Consequently, from the error decomposition (14.2ft 
we have 


\\ft+i - f P \\p < \\<A(.L K ){fp)\\p + || Y,U^Ui{L K )A% (4 on 

+ 11 Yl t j=2^i ujt j+i( L K)B J \\p. 

In the following subsections we estimate the terms on the right-hand side of in¬ 
equality (14.3p . 


4.1 Estimation of the sample error 

We now turn our attention to estimating the sample error, i.e. the last two terms 
on the right-hand side of inequality (14.31) . To this end, we first establish some 
useful lemmas. The following lemma gives an upper-bound of the learning sequence 
{ft : t G N} under the Hk norm, which is mainly inspired by a similar estimation in 
[ID] for bounding the iterates of online gradient descent algorithm in the univariate 
case. 

Lemma 1. Let the learning sequence {f t : t e N} be given by OPERA H2.A) and 
assume, for any t 6 N, that < 1. Then we have 


t- i 

\\M\k<2M * Vie N. (4.4) 

\| 3= 2 

Proof. For t = 1 or t = 2, by definition f\ — / 2 = 0 which certainly satisfy (14.4|) . 
It suffices to prove the case of t > 2 by induction. Recalling equality (|2.2[) . we have 

t- i 

ll/mllx = WftWl ~ BtJ2(ft(x t ,Xj) - y t + yj)Mx t ,Xj) 

3 = 1 

t -1 

S (/*(**. _ yt + yj)(ft(xt,Xj') - yt + yj')K((x t ,Xj), (x t ,x r )) 

j,j '=i 

t-i 

< ll/t|lA- + $r5Z(/t(®t > a;j) -yt + yj) 2 

3 

t -1 

~ Vt + 
j=i 
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Define a univariate function Fj by Fj(s) = n 2 ^t{s — yt + Uj) 2 — 2(s — y t + Uj)s. It is 

easy to see that sup sgR Fj(s) = — (2Af) 2 since ytK 2 < 1 and 12/y I +12/iI < 2M. 

Therefore, from the above estimation we can get, for t > 2, that 

t -1 

Wft+iWlc < ||/t||*- + ^^supi^s) - ll/tllif + ( 2M )V 

j=i j 

Combining the above inequality with the induction assumption that ||/t||x < 
aA'VE'dv implies the desired result. This completes the proof of the lemma. 

□ 

Denote the operator norm Wco^Lk) L^\\jr( L 2 p ) = su P||/|| p <i \\u‘(L K )L‘M)l. The 
following technical lemma estimates the operator norm, which is simply implied in 
the proof of Lemma 3 in [40j . 

Lemma 2. Let (3 > 0 and y^K 2 < 1 for any integer i G [j,t]. Then there holds 

\H{ l k)L^ k \\c{ L d < {{~Y + n 2 ' 3 ) min |l, 

6 e=j 


The estimation of the sample error also relies on an important characterization 
of TLk by the fractional operator L)l 2 (see Theorem 4 and Remark 3 in [12]). 
Specifically, for any / G TLk there exists g G L 2 (X 2 ) such that L)Yg = / with 

property \\f\\ K = WL 1 ^ 2 g\\ K = \\g\\ p . With this characterization of TLk , it is easy to 
see, for any j < t and / G TLk , that 


wj + i(Lie)/||p - ||c^ +1 (Lx)-h^ 2 fi , || P < ||^ +1 (Tfir)L^ 2 ||£( L 2)||fif|| p 
= \\u] + 1 (L K )L)Y\\c{Lf,\\f\\K- 


(4.5) 


We also need the following probabilistic inequalities in a Hilbert space. The 
first one is the Bennett’s inequality for random variables in Hilbert spaces, which 
can be easily derived from [28, Theorem B 4], 

Lemma 3. Let {& : i — 1, 2,..., t} be independent random variables in a Hilbert 
spaceTL with norm ||-||. Suppose that almost surely ||£j|| < B andE ||^|| 2 < a 2 < oo. 
Then, for any 0 < <5 < 1, the following holds with probability at least 1 — 5, 


il 

1 1 


t 

- E?.:] 

i =1 




Tog | 


The second probabilistic inequality is the Pinelis-Bernstein inequality [541 Propo¬ 
sition A.3] for martingale difference sequence in a Hilbert space, which is derived 
from [24| Theorem 3.4], 
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Lemma 4. Let {Sk : k E N} be a martingale difference sequence in a Hilbert space. 
Suppose that almost surely ||5'fc|| < B and Y^k=i E[||*S'fc|| 2 |»S'i,..., Sk- 1 ] < erf- Then, 
for any 0 < 5 < 1, the following holds with probability at least 1 — 5, 


sup 

i <j<t 




k =1 


< 2 




We also need some facts on Hilbert-Schmidt operators on TLk, see mm- 
Specifically, let HS(TLk ) be the Hilbert space of Hilbert-Schmidt operators on TLk 
with inner product {A, B) HS = Tr (B T A) for any A, B E HS(TLk )• Here Tr denotes 
the trace of a linear operator. Indeed, the space HS(TLk ) is a subspace of the space 
of bounded linear operators on TLk, which is usually denoted by (C(TLk), || • | \c(h k )) 
with the property, for any A E HS(TLk), that 

ll^lU(Wic) < ll^lks- (4.6) 


With the above preparations, we are ready to estimate the sample error for al¬ 
gorithm (\2.2h which, according to the error decomposition (14.31) . consists of terms 
II Y^ t j= 2 r Yj u, j+i(LK)A : ’\\p and || Yj =2 || p . Let us start with the estima¬ 

tion of || EUtM + i(l k )A%. 

Theorem 4. Assume 7 t K 2 < 1 for any t E N and let {f t : t G N} be given by 
equation \2.2 1) . For any t > 2 and 0 < 5 < 1, with probability 1 — 5 there holds 


'Yhi ut i+i{ L K)A 3 \\p < 

3=2 


AT Z 

[12«(1 + k) 2 M log —] Y 

3=2 


7,(1 + (E£ 2 S) 1/2 ) 

VJt 1 + EL j+ iT«) 1/2 


Proof. Write 

t t t 

E W + 1 (U<-)^ := y ^jUjj +1 (L K )A{ + YwULMi 

3=2 3=2 3=2 

where A\ = (Lj — L K )fj and A 2 = —(Sj — L K f p ). Hence, 

t t 

< iiE 7^+i( L A'Mill p 

3=2 3=2 

t 

+ 11 Y^ULk)A%. 

3=2 


(4.7) 
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For the first term on the right-hand side of equation (14.7f) . we have 




3 = 3 
t 


3 =3 


<Y.^M+ L + L fWi) Milk 

3 =3 
t 

< ll^j - ^ll£(wjf)ll/tlk 


(4.8) 


i=3 

t 

- n^j - -kiditfsii/jiiftr, 

3=3 

where the second inequality used (14.5ft and the last inequality used (14.6ft . 

Let the vector-valued random variable £(x) = J x (-, K( x ', x )) kK( x ', x )dpx{x'). By 
following the proof of Lemma 2 in [13] . we have that ||(-, K( x i^) k K(x>, x )\\hs < k 2 - 
Hence, ||^||hs < f x ||(-, K {x ,^ x) ) K K (x ,^ x) \\ HS dp x (x') < k 2 . Applying Lemma ED with 
B = a = k 2 and H = HS(Hk), we have, with probability 1 — |, that 


| La — L K \\ HS 


j~ 1 


HS 


£= 1 


(4.9) 


< 2g 3 log f _ J; .2 / 'og x < V2K 3 logf 


— i-i 1 V j-i ~ Vj 
A pplying Lemma [2] with (3 = 1/2 implies, for any 2 < j < t, that 

t 


m + i(L x )L^ /2 || £(i 2 ) < ((^) V2 + k) min {l, ( J] 7<) 

1 = 3+1 

< V^(l + k) /(l + XlLj+i 7f) / > 


- 1/2 


(4.10) 


where we used the conventional notation J^ =f+1 7^ = 0. Putting estimations (14.9p . 
(14.10p and inequality (14.4p in Lemma [T| back into (14.8p . with probability 1 — 5 there 
holds 


2^ 7i(Ed7^) 1/a 


Y, 13 ^A L k)A\\p< [L2«: 2 (l + «)A/log-]^. 

0 j=3 Vd(l + E^= j+ i7^) 


(4.11) 


3=3 


For the term || we observe from (14.5p again that 


3 =2 


I>u<‘ +1 (£ K )4ll„ < EvH + i(£ k )A /2 IU(l S ) Il4lk 


1=2 


3=2 

t 


(4.12) 


< ll^i f p||A'- 

3 =2 
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Let the vector-valued random variable £(z) = f r (f p (x r ) — y)K( x i >x ' ) dpx(x') G 'Hr- 
Observe that \\£\\ K < J x \ f p (x') - y\\\K^ x , yX) \\ K dp x {x') < 2 kM. Applying Lemma [3] 
with B = a = 2 kM and PL = PLr , we have, with probability 1 — that 

\\Sj-L K !,\\ K =||jy^«*,)-E«)|| K 

< + 2kM,/sI 

— j-i V j-i 

^ 6\/2K.Mlog?7 

- W ' 

Putting the above estimation and inequality (14.10p into (14.12[) implies, with prob¬ 
ability 1 — 5, that 

\\^ljUj+i{ L K)M\\p < [ 12 «(! + «)Aflogy] ”777- 777 -777' ( 4 - 13 ) 

j=2 0 ,=2 VJ(l + E^'+l70 

Combining inequalities (14.lip and (j4.13j) . we have, with probability 1 — 5, that 


t 


i =2 


\t 

[12«(1 + k) 2 M log —] ^ 

i =2 


7,(l + (Ell7) 1/2 ) 

v / J(i + EL + .0 1/2 


This completes the proof of the theorem. 


□ 


We move on to the estimation of the term || Y^j =2 7i w j+i (Lr)B- 


ip- 


Theorem 5. Assume 7 t /t 2 < 1 for any t 6 N and let {f t : t G N} be given by 
equation h2.2\) . For any t > 2 and 0 < 5 < 1, with probability 1 — 5 there holds 


64 


I>j<4n(£yei P < T 7(i+K) 2 Miog-) 


7(1 + Etb«)\i 


J=2 


72 1 + 


i=2 


4+i 


7^ 


Proof. Notice, from the recursive equality (12.21) . that fj only depends on samples 
{^i,..., %_i} and f\ = fi = 0. Therefore, for any j > 2, there holds 

(4.14) 

which means that {£j := 'yjCOj +1 (L K )B- j : j — 2,..., t} is a martingale difference se¬ 
quence. In the following, we will apply Lennna[4]to estimate || Xy =2 h'j^j+i {Lk)^ || p 
T o this end, it remains to estimate B and a 2 . 

Recall that & = (.Lj — Lj)fj — (Sj — Sj ). By (14 .6 [) and Lemma [lj we have 

Ill'll k < \\Lj -ZjWcin^WfjWK +j\Sj - Sj\\ K 

< ||Lj - LjWnsWfjWK + ||4 - Sj\\ K 

j ~ l i 

< 2k 2 || fj ||k + 2 kM < 4k 2 M(^^7£) 2 + 2 kM. 

1=2 
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Consequently, 


\^)+l{L K )Bj\\ p < l| w j+ 1 (-^A')-^A^ \\c{l 2 p ) ll^i Ha' 

< —— (4k 2 m (J2izl If) * + 2kM) 

(1+ XI 7 ^) 1/2 ( 4 - 15 ) 


e=j+i 

<8K(l + K) 2 M( ^P 7e 
— v ' V i+Etej+i 7£ 

where the second inequality used Lemma (14.101) . From the above estimation, we 
have 

t 

X% 2 E(ll^+i(^A')4ll?ki> ■ ■ ■ ,Zj-l) 

3 =2 

^2 c/I 20 i \4 Ji,f2 52t=: 2 7t) 

< a 2 := 64 k 2 (1 + k) 4 M 2 ^ - , 

J^2 1 + Ei= j+ l7f 

and 

B = BuP2<j<t7jl|wj+i(ijr)B i llp < 8 k(1 + k) 2 M(sup 2 <,-<, ;|Vr' *, 2 C ” 

<8«(1 + K ) 2 m(^ 72(1 + E '- 21t,) ^ 


j= 2 1 + 5^=j+l It 


Applying Lemma H] yields that, with probability 1 — <5, 


3 =2 


t (j \nj II ^64/ V2,,, 2v 7j(l + Y)\=2 7ft) \ 2 

X 7j^+i(^A')^ lip < —{k(1 + k) M log-) (2^ 

3 =2 


5 ' 1=2 1 + S*=7+l 7^ 


This completes the proof of the theorem. 


□ 


4.2 Estimates of the approximation error 

Here, we establish some basic estimates for the deterministic approximation error 
involving Wou^L^fpWp- To this end, we recall the notion of /C-functional |6j in 
approximation theory, namely 

JC(s,f p ) := inf {||/-/p||p + s||/|kK s >°- ( 4 - 16 ) 

jtrtK 

We can estimate the quantity \\col(LK)fp\\ P as follows. 

Lemma 5. Assume ytK 2 < 1 for each 1 G N. Then the following statements hold 
true. 


17 










(4.17) 


(a) Let the IC-functional defined by H2.5\) . Then, we have 

t 

\\^2( L K)fp\\ P < K,(y/2{1 + K)(^7 i )" 5 ,7 p ). 

3 =2 


(b) If f p E L^ K (C 2 px ) with some /3 > 0 then 


l a 4(-^)/pllp < 2 


e 


+ ^)ll 

3 =2 


(4.18) 


(4.19) 


Proof. Part (a) is proved as follows. For any / E Hr, from (14.5ft we have 

W^LhfifpWp < 11/ - Jp\\p + M(L*)/I]p- 

= 11 / “ JpWp + \\ u i{ L K)L 2 K \\c(Ll)\\f\\K- 

Applying Lemma [2] with fi = ~,j = 2, implies that ||u4(La')-^|'I|£(l 2 ) < \/2(l + 

t 

E _ 1 

7 j) 2 . Then, substituting this into the right-hand side of (14.19)1 yields that 


3= 2 


|4(Lk)/ p || p < inf {||/-/p|| p + v / 2(l + K)(^ 7 i ) 2 ||/|| x 

/=2 


(4.20) 


Part (b) can be directly proved by applying Lemma [2] and the following obser¬ 
vation 

II Cl 4(-^A')/pIIp < \\^2( L k)Lr\\c(LI) \\Ljf fp 


Ip\\p- 


□ 


5 Proof of Main Results 


In this section, we prove the results presented in Section [2l Let us start with the 
proofs for Theorems [[] and [2j To this end, we need some technical lemmas. 

Lemma 6. Let 7 j = J — for any j E N with 6 E (|, 1) and /1 > 0. Then we have, 
for any t > 4, that 


where 


7j(l + Y^j=2 It) 

fev7(i + EL +1 7«) 1/2 


< C e r min{0 ~ 


2 1 log t, 


C e = 


26 max|y p,(l—6)) 1 p(l— 
p(l-6»)|36»-2| 

20 max ^ yj p(l— 0))~ 1 , yj p(l— 8)J 

W^e) 


+ 


+ 



if 9^ 2/3 
if 9 = 2/3. 


(5.1) 
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Lemma 7. Let 7 j = 3 — for any j G N with 6 G (0,1). Then we have, for any 
t > 4, that 




j—2 1 + E(=j + 1 


(5.2) 


where Cg = 


( 5 I 16 max((//l .1 - 0 )) 1 ■/-(1 — 6 >)) \ 1/ '~ -f a / 9/0 
\8fi~ 1 ^(1—6>)|36>—2| J ’ < / t 7 7 tZ /' 3 

^5 | i6max((^(i-e))^- 1 ,^(i-e)) if 0 = 2/3 


The proofs for Lemma [d] and Lemma [7] are given in the Appendix. With the 
above lemmas, we are ready to establish the main results stated in Section [2} 


Proof of Theorem [Q Applying (j4.3|) with t = T, we have 


||/t +1 - fp\\p < 11^2 (L K )(fp)\\p + || Y/j=2'yj UJ J+l{ L K)A J 


+ 11 'f r jJ : I ( Cv 'V 


(5.3) 


I p- 


By Theorem [4] and (15.11) . with probability 1 — 5, there holds 




< 


12k(1 + k) 2 M log 


'hi 1 + EL 2 It) 


< 12 C e k{ 1 + k) 2 MT -"( 9 -2 —) log T log 


f^2 V3( 1 + ^e=j + 1 If) 

4T 


From Theorem [5] and (15. 2 p we have, with probability 1 — 5, that 


(5.4) 


\\Y.UwU L «^% < f«(i + K ) 2 M(Ej. 2 i^4 ^) 1/2 

< ^^l + zc^MT-^-a.^logTlogf. 
Putting estimates (j5.3j) . (15.4[) . and (15.5p . with probability 1 — 25 there holds 

\\f T+ i - fX <\\^(L K )(fy\\ p + C e ,,T-^-l=hogTlog^f, 


(5.5) 


(5.6) 


where Cg )K = 4(30) + ^ £ )«(1 + k) 2 M. 
In addition, by (j4.17[) . we have 


ll4(oa(/„)t<K(^2(i + K )(^7j)^,4). 

3 =2 

Notice that E^7,' = jEjUr* > ^ ^ 


nh-o) 


> 


'j'l—Q 


> 


3p 


Consequently, 


WliL K )ifp)\\p < /C(v^I(l + k)T-^J p ). 


— 3/i(l-0) — 


(5.7) 
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Putting this back into (I5.6[i implies the desired result. This completes the proof of 
the theorem. □ 


Proof of Corollary [T]. By the definition of the almost surely convergence, it 
suffices to prove, for any e > 0, that 


lirn P(sup[||/ t+ i 

to-s-oo v t >t 0 




P\\P 


inf ||/ - f P \\ P ] > 2e) = 0. 

J^rtK 


However, it is well-known that lim s ^. +0 /C(s, fp) — ^fen K \\f ~~ fp\\p ( see e -S- 
Lemma 9 in [40]). This means that there exists 1 1 6 N such that, for any t > t\, 
there holds 

/C(v / 6«(1 + K)r^J p ) - inf ||/ - f p \\p < e. 

j € Hk 

Let TZ t = ||/t+i — fp\\ p — ZC(\/6/c(l + , / p ). The above estimation implies, for 

any to > ti, that 


P(sup[||/ m - fp\\ p - inf ||/ - fp\\p] > 2e) 
t>t 0 j£T~Lk 

oo 

< P(snp7 Z t > e) < Y^P(7 Z t > e). 

From Theorem 1, we have, for any 1 — 5, that 

P(^t > Cg }K t~ min ^’^ logtlogiAt/S)^ < 6. 
which is equivalent to 

C e Aogt 


P (n t > e) < 4t exp 

Putting this back into (j5.8j) implies that 


sup[||/ t+ i - fp\\ p - inf ||/ - fp\\p\ > 2e) < ^4texp 

t>t 0 t&i K ^ 


For any 1/2 < 9 < 1 and £ > 0, it is easy to see that 


CoAogt 


S 4texp( ~ C M bgt )<0 °- 


Consequently, 


lim > At exp 

o—^ 


to^OO 


t=to 


£min(6>-2, — ) £ 

C/, K log t 


= 0. 


Combining this with (|5.9[) implies, for any e > 0, that 

lim P(sup[||/ m - Jp\\ p - inf \\f - j p \\ p \ > 2e) = 0. 

*0->°O y t>to 


(5.8) 


(5.9) 
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This completes the proof of the corollary. 


□ 


From Theorem [T| and the estimation (14.1811 for the approximation error, we can 
derive the explicit error rates for OPERA stated in Theorem [21 

Proof of Theorem [21 Applying (14.18H with (3 > 0 and (16.11) with 7 1 = -i ~ e , 
j = 2 and k = T, we have that 

\W(LKK\U { L,i) < ((f)" + ^) (eL^E < ((f)" + *?“) (eL A") -8 

<((!)"+ « 2 ")'= 2 V( EL V 6 ) -8 

< ((f)" + -i) _ " 

< [((f)" + _ /}))/>(! _ (I)i-»)-0]r-fld-*) := D^T-OP-') 

Putting this estimation into Theorem [T] yields, with probability 1 — 5, that 

II/t +1 - Zpllp < D^pT-W-V + Ce, K T- m ^ e -^hogT\og{8T/S)). (5.10) 
Selecting 6 = min{||^A, |j implies, for probability 1 — 5, that 

II/t+i -7p||p < (D K ,p + C e , K )T-M^) logTlog(8T/5). 


This completes the proof of the theorem. □ 

We now turn our attention to the special pairwise kernel (12. 8 p induced by a 
univariate kernel G. Let us first prove Proposition Q] which describes the relation¬ 
ship between the space Hk with the pairwise kernel K and T-Lq with the univariate 
kernel G. 


Proof of Proposition [Q To prove (a), for any n G N, {a* : i = 1,..., n} and 
{(x), x 2 ) xf : i — 1,..., n}, let g = E^=i a i{G x \ — G x 2 ) G 'He- Indeed, it can 
be further be verified that g G Xq since {g, 1 x )g = (EILi a i(G x i — G x ?), 1 X ) G — 
EIU oii(l x {x}) - 1 x(Xi)) = 0. Then, for any x/x 2 G X , 

EILi (XiK (x ^(x^x 2 ) = ELi ai(G x i{;x l ) - G^x 1 )) - Et=i oii(G x i(x 2 ) - G x 2 (x 2 )) 


From the observation that ^((x 1 , x 2 ), (x 1 , x 2 )) = (G x 1 — G x 2 , G x 1 — G ^) g , we also 
see that 


ll^(5 f )l|x || ^ ^ OiiK^xj^xf) IIif II ^ ^ ^-2 ^®?)I|g IMIg- 
2=1 2=1 

According to [2], the RKHS Hk is the completion of the above linear span of kernel 
sections {K^ x \ ^ : x\, x 2 G X, i = 1, ..., n} and likewise, Xi G is the completion of 
the linear span of kernel sections {{G X ^,G X 2 } : x],x 2 G X,i = l,...,n} which 
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implies that, for any / G TLk, there exists g G Z G such that /(a; 1 ,^ 2 ) = g(x l ) — 
g(x 2 ), and ll/ll k = ||<?||g- If remains to prove that $s(g) = 0 then g G Tq. Indeed, 
9(g) (ad, x 2 ) = 0 implies that g(x l ) = g(x 2 ) for any a; 1 , a; 2 G X. This means that 
g is a constant function which means g G Z G . This completes part (a) of the 
proposition. 

Part (b) follows from part (a) since, in this case, Tq = {0} which implies 
TLg = Tq. This completes the proof of the proposition. □ 

Secondly, for the special pairwise kernel given by ( 12.8j) . we can establish the 
convergence of online pairwise learning algorithm (12. 9 j) as stated in Theorem [3] 

Proof of Theorem [3} Part (a) directly follows from Theorem CD Proposition [D 
and the definition of /C G given by (12.lip . 

For part (b), under the assumption lx d TLg, from Proposition Q] we have 
IC G (s, jp) < 2 inf g£nG {\\g - f p || p + §||c/||g} 

2 o \i/2 (5.11) 

< 2V2[in{g &HG {\\g - f p \\ 2 p + s T \\g\\ 2 G }) . 

According to [121 [15], ml gena {\\g - f p \\ 2 p + \\\g\\ 2 G } < X 20 \\L G ^f p \\ p for any (3 < 1/2. 
Now applying this estimation and (15.111) with A = ^ and s = Y^j = 2 7 j hnplies that 

IC G (V 6 k (1 + K)T-^Jp) < OiT-^-W). 

2^+1 

Putting this into (12.6[) and choosing t 2 /3+ 2 yields the desired result. This 

completes the proof of the theorem. □ 


6 Conclusion 

This paper studied an online learning algorithm for pairwise learning in an uncon¬ 
strained RKHS setting called OPERA. OPERA has a non-strongly convex objective 
function and is performed in an unconstrained setting, for which we are not aware 
of similar studies for such online pairwise learning algorithms. We established its 
almost-surely convergence and derived explicit error rates for polynomially decay¬ 
ing step sizes. Below we discuss some possible directions for future work. 

Firstly, the rates of OPERA under the regularity assumption f p G l k( l I ) are 

~ /3 

of the form E[|[/'r +1 — f p \\ p ] < 0(T 2 / J + 2 ), which is suboptimal compared with the 

(3 

rate 0{T 2 P+ l ) in the univariate case [40]• If would be very interesting to improve 
the rates of OPERA. Secondly, OPERA is not a fully online learning algorithm 
since it needs to save previous samples z* = {( Xi,yi ) : i = l,...,t} at iteration 
t, although, in the linear case, efficient implementation may be possible. Hence, 
to improve the efficiency of OPERA the other direction would be to introduce a 
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memory-efficient implementation which uses only a bounded subset of the past 
training samples as in [18, 36]. Finally, we only considered the least-square loss for 
pairwise learning. It is particularly interesting and challenging to establish similar 
results for general loss functions. 
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Appendix 


Here we present the proofs for Lemmas [6] and Lemma 0 To this end, we first state 
a technical lemma which will be used later. 

Lemma 8 . Let 7 j = -— for any j G N with 6 6 (0,1). Then, for any 1 < j < k, 
there holds 

k 

((* +1) 1 -" - J 1 -") < E -w £ - U - 1) 1 ""). (6.1) 


M 1 - °) 


l=j 


n(l-0) 


Proof. Notice that f 9 < s 9 for s G [£ — l,f] and f 9 > s 9 for s G [£, f + 1]. 
/»•£+1 ^ 1 ^ 

Hence, - Yli=j / s~ e ds < Y"' 7^ < — Y'' / s~ 9 ds which implies that 


i=i 


i=i 


£-1 


-1 rk -\-1 ^ -| /*&; 

— / < Y^ '-fi < — / s~ 9 ds. 

0 Jj frf /' J 3-1 


'3 e=j 1 

The desired result follows directly from the above inequality. □ 

We are ready to establish the proof of Lemma [61 


Proof of Lemma HU Let J := V = 


t 7 J -(i+(Ej~2 1 7b 1/2 ) 


1=2 


7f) 


Aj. It can be written as 


7- _ p(i+(XL2 7d 1/2 )l r 72 1 | 1 tRi+IXU^) 172 ) l 

J l - Vt -J + l LY ZZ - WH + 1^3=3 


Vi 

J\ + Jt. + 3 3 - 


■^(i+Ei= 3 7i) 


V7(l+(Xfcj+i7c) 1/2 ) 


( 6 . 2 ) 
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We estimate Jx.J-i-, and separately as follows. 

Firstly, let us look at the term J\. Indeed, by (16.ip we have 

•7i = h' 9 - 1/2 ( i+«- # ) i/2 ) < h- 9 " 1/2 ( 1 + 

< 2max(l, (y/fx(l — 0)) -1 )i - ^. 

Secondly, for the term J 2 , we apply (16.ip again to get that 


(6.3) 


Ji < 


2-B-1/2 


JEEEl < f_m£l 1 1/ V(i-»>/2 < I A) 1/2 f-(i-«)/2 
.\>« - iFBsFEJ 1 - \2u) 1 < 




(6.4) 

where the second to last inequality used the assumption t > 4 which implies 3 1 0 < 
(|(f + l)) 1-0 ,, and the last inequality used the property that, for any 0 < 9 < 1 
and 0 < x < 1 , that (1 - x ) 1 " 0 > (1 - 0)(1 - x). 


Lastly, we estimate the term J%. To this end, by 
follows: 


we can estimate as 


A <JES, 






1/2 


< f max(l,( v //i(l- 0 )) ^EUs 


30 


( 1 +^ibj(h+ 1 ) 1 - e -o-+i) 1 - 9 )) 


172 


36 

rj: 


< l max( v / /i(l — 9)) 1 , y/fj,(l-9)) E ;=3 7 -TT^- 


(6.5) 


t -1 


It remains to estimate E /=3 


30 


( !+((*+l)l -9 — (j+l)l- 9 )) 


372 - To this end, we further de¬ 


compose it into two terms as 

<7=3 (l+((*+l)l-9-(j+l)l-0))l/2 = (52j>t /2 + X) 3<j<t/2) (H-((t+l)l-e-(j+l)l-9))l/2 


E t~ 1 
7=3 


36 


36 


J 31 + J 32 . 


( 6 . 6 ) 

For 7731 , for any s G [ 7 , j + 1], that j -0 < 2 0 (1 + s)~ 9 and (t + l ) 1 ^ 0 — (j + l ) 1 ^ 0 > 
(t + I ) 1 ’ 0 - (s + I) 1 " 0 . Then, 


J31 < 22^2 Ej>t /2 (l+((t+l)l-^( 7 +l)l-e))l /2 

< 2 30 / 2 rf E 5 >i /2 J 3+ (1 + s ^~ eds 


(l + (t + l) 1 - 0 -(s + l) 1 - 0 ) 1 /2 

(1 + s)~ e ds _ ( 6 . 7 ) 


< 2 3 «/ 2 H / _ 

Jt/2 (1 + (i + l) 1-e — (s + l ) 1 - 0 ) 1 / 2 

< [(i + (<+ 1 ) 1 -") - m +d 1 - 6 ] 172 
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For J 32 , the fact that (t + l ) 1 9 — (j + l ) 1 9 > (1 — (2/3 ) 1 e )(t + l ) 1 9 for any 
j <tj 2 implies that 


j, V) <_I_ v r 39/2 < 3 t~F- e )/ 2 V ? 

t^il — ( 2 / 3 Y~ 9 ) 1 — 6 ^ 

1 \ L 1 3<j<t/2 3<j<t/2 


-39/2 


( 6 . 8 ) 


Notice that 


ft/2 


E r? < r s -^ ds <\ if ^ 2 / 3 


3<j<f/2 - k lll( ' 

Putting the above inequality into (16.8p yields that 

J 32 <^r min{0 -^^inf, 


if 0 = 2/3 


(6.9) 


where A e = ( 1 _ g ^ 3 g_ 2 | if $ 7 ^ 2/3 and j-Lj otherwise. Combining (16.7j) and (16.9p . 


(16.5p . and (16.6H together implies that 

Js < 

where 

B e = 


( 6 . 10 ) 


4max(yV(l-0)) to AT , 3 \ ■{ Q _Z o /o 

- M1 _ g) - (2y 2 + p=2f J, li ^ T 2/3 

2(3+4^) max(^//i(l-0))- 1 ,^//i(l-6»)) ;fa _ ort 

/i(l-fi) ’ llU—A/ 6 . 

Now putting estimates (16.3p . (16.4p . and (16.10p together yields the desired result. 
This completes the proof of the lemma. □ 

We now turn our attention to the proof for Lemma [71 
Proof of Lemma [7] Let X = Y?j =2 2lt -\ We can write X as 


■‘it 2 i+E fcj+1 n 


x = [ 7^(1 + Y&n)] + [qT^y] + E= 

:= X 1 T 2*2 T X 3 , 


t -1 7f(l+Si=2 7l) 
i+Efcj+i 7t 


( 6 . 11 ) 


where we used the conventional notation - + 1 7 ^ = 0 for any j G N. We estimate 
X \, X~ 2 , and I 3 term by term as follows. 


Firstly, let us first estimate X\. By (16.111 . we can have that 

Xi + 1 )^- 1 )) 

< 2max(l.(^.(l—0)) 1 ) i_ 1-30 


( 6 . 12 ) 


Secondly, we move on to the estimation of term 1-2. By (16.ip . we obtain that 
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where, in the second to last inequality, we used the assumption t > 4 which implies 
3 1-0 < (|(t + l)) 1 ” 0 , and the last inequality used the fact, for any 0 < 9 < 1 and 
0 < x < 1, that (1 — x) 1 ” 0 > (1 — 0)(1 — x). 

Finally, we turn our attention to the estimation of X 3 . Applying (I6.1j) again to 
Z 3 implies that 
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It now suffices to estimate the term Z 3 := V* i T 
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For the first term X 31 , observe, for any s E [j, j + 1], that j 0 < 2 0 (1 + s) 0 and 
(t + l) 1 ” 0 — (j + l) 1-0 > (f + l) 1 ^ 0 — (s + l) 1-0 . Therefore, 
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For X 32 , we have 
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where we used again the fact, for any 0 < 6 < 1 and 0 < x < 1, that (1 — x) 1 ^ 0 > 
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(1 — 0)(1 — x). Also, by a simple calculation, there holds 


V ,i-3» < / ra rmln< °’ 3 " _2) - 

^ \ lnt, if 9 = 2/3. 

3<j<t/2 K ' 

Putting the above estimation into (16. 17j) yields that 

1 32 < Ag t~ min(20 " u -e) ln t ( 6 . 18 ) 

where Ag = | 3 g_ 2 ^ 1 _ 6 , ) if 6 ^ 2/3 and otherwise. Putting (16. 16|1 and (16.1 8p back 
into (16. 141) implies that 

X 3 < B e r ln ( 6 . 19 ) 

where Bg = (1 _g^ 3 g_ 9 | + 4ifd^2/3 and + 4 otherwise. Combining estimates 
(16■ 1 2ji . (16. 13j) . and (16. 19|i together yields the desired result. This completes the 
proof of the lemma. □ 
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